2,868 research outputs found
ViZDoom: DRQN with Prioritized Experience Replay, Double-Q Learning, & Snapshot Ensembling
ViZDoom is a robust, first-person shooter reinforcement learning environment,
characterized by a significant degree of latent state information. In this
paper, double-Q learning and prioritized experience replay methods are tested
under a certain ViZDoom combat scenario using a competitive deep recurrent
Q-network (DRQN) architecture. In addition, an ensembling technique known as
snapshot ensembling is employed using a specific annealed learning rate to
observe differences in ensembling efficacy under these two methods. Annealed
learning rates are important in general to the training of deep neural network
models, as they shake up the status-quo and counter a model's tending towards
local optima. While both variants show performance exceeding those of built-in
AI agents of the game, the known stabilizing effects of double-Q learning are
illustrated, and priority experience replay is again validated in its
usefulness by showing immediate results early on in agent development, with the
caveat that value overestimation is accelerated in this case. In addition, some
unique behaviors are observed to develop for priority experience replay (PER)
and double-Q (DDQ) variants, and snapshot ensembling of both PER and DDQ proves
a valuable method for improving performance of the ViZDoom Marine.Comment: 9 pages, 5 figure
Deep Ordinal Reinforcement Learning
Reinforcement learning usually makes use of numerical rewards, which have
nice properties but also come with drawbacks and difficulties. Using rewards on
an ordinal scale (ordinal rewards) is an alternative to numerical rewards that
has received more attention in recent years. In this paper, a general approach
to adapting reinforcement learning problems to the use of ordinal rewards is
presented and motivated. We show how to convert common reinforcement learning
algorithms to an ordinal variation by the example of Q-learning and introduce
Ordinal Deep Q-Networks, which adapt deep reinforcement learning to ordinal
rewards. Additionally, we run evaluations on problems provided by the OpenAI
Gym framework, showing that our ordinal variants exhibit a performance that is
comparable to the numerical variations for a number of problems. We also give
first evidence that our ordinal variant is able to produce better results for
problems with less engineered and simpler-to-design reward signals.Comment: replaced figures for better visibility, added github repository, more
details about source of experimental results, updated target value
calculation for standard and ordinal Deep Q-Networ
Understanding structure of concurrent actions
Whereas most work in reinforcement learning (RL) ignores the structure or relationships between actions, in this paper we show that exploiting structure in the action space can improve sample efficiency during exploration. To show this we focus on concurrent action spaces where the RL agent selects multiple actions per timestep. Concurrent action spaces are challenging to learn in especially if the number of actions is large as this can lead to a combinatorial explosion of the action space.
This paper proposes two methods: a first approach uses implicit structure to perform high-level action elimination using task-invariant actions; a second approach looks for more explicit structure in the form of action clusters. Both methods are context-free, focusing only on an analysis of the action space and show a significant improvement in policy convergence times
The Dreaming Variational Autoencoder for Reinforcement Learning Environments
Reinforcement learning has shown great potential in generalizing over raw
sensory data using only a single neural network for value optimization. There
are several challenges in the current state-of-the-art reinforcement learning
algorithms that prevent them from converging towards the global optima. It is
likely that the solution to these problems lies in short- and long-term
planning, exploration and memory management for reinforcement learning
algorithms. Games are often used to benchmark reinforcement learning algorithms
as they provide a flexible, reproducible, and easy to control environment.
Regardless, few games feature a state-space where results in exploration,
memory, and planning are easily perceived. This paper presents The Dreaming
Variational Autoencoder (DVAE), a neural network based generative modeling
architecture for exploration in environments with sparse feedback. We further
present Deep Maze, a novel and flexible maze engine that challenges DVAE in
partial and fully-observable state-spaces, long-horizon tasks, and
deterministic and stochastic problems. We show initial findings and encourage
further work in reinforcement learning driven by generative exploration.Comment: Best Student Paper Award, Proceedings of the 38th SGAI International
Conference on Artificial Intelligence, Cambridge, UK, 2018, Artificial
Intelligence XXXV, 201
Geometry of Policy Improvement
We investigate the geometry of optimal memoryless time independent decision
making in relation to the amount of information that the acting agent has about
the state of the system. We show that the expected long term reward, discounted
or per time step, is maximized by policies that randomize among at most
actions whenever at most world states are consistent with the agent's
observation. Moreover, we show that the expected reward per time step can be
studied in terms of the expected discounted reward. Our main tool is a
geometric version of the policy improvement lemma, which identifies a
polyhedral cone of policy changes in which the state value function increases
for all states.Comment: 8 page
Crawling in Rogue's dungeons with (partitioned) A3C
Rogue is a famous dungeon-crawling video-game of the 80ies, the ancestor of
its gender. Rogue-like games are known for the necessity to explore partially
observable and always different randomly-generated labyrinths, preventing any
form of level replay. As such, they serve as a very natural and challenging
task for reinforcement learning, requiring the acquisition of complex,
non-reactive behaviors involving memory and planning. In this article we show
how, exploiting a version of A3C partitioned on different situations, the agent
is able to reach the stairs and descend to the next level in 98% of cases.Comment: Accepted at the Fourth International Conference on Machine Learning,
Optimization, and Data Science (LOD 2018
Improving Search through A3C Reinforcement Learning based Conversational Agent
We develop a reinforcement learning based search assistant which can assist
users through a set of actions and sequence of interactions to enable them
realize their intent. Our approach caters to subjective search where the user
is seeking digital assets such as images which is fundamentally different from
the tasks which have objective and limited search modalities. Labeled
conversational data is generally not available in such search tasks and
training the agent through human interactions can be time consuming. We propose
a stochastic virtual user which impersonates a real user and can be used to
sample user behavior efficiently to train the agent which accelerates the
bootstrapping of the agent. We develop A3C algorithm based context preserving
architecture which enables the agent to provide contextual assistance to the
user. We compare the A3C agent with Q-learning and evaluate its performance on
average rewards and state values it obtains with the virtual user in validation
episodes. Our experiments show that the agent learns to achieve higher rewards
and better states.Comment: 17 pages, 7 figure
Multi-agent Hierarchical Reinforcement Learning with Dynamic Termination
In a multi-agent system, an agent's optimal policy will typically depend on
the policies chosen by others. Therefore, a key issue in multi-agent systems
research is that of predicting the behaviours of others, and responding
promptly to changes in such behaviours. One obvious possibility is for each
agent to broadcast their current intention, for example, the currently executed
option in a hierarchical reinforcement learning framework. However, this
approach results in inflexibility of agents if options have an extended
duration and are dynamic. While adjusting the executed option at each step
improves flexibility from a single-agent perspective, frequent changes in
options can induce inconsistency between an agent's actual behaviour and its
broadcast intention. In order to balance flexibility and predictability, we
propose a dynamic termination Bellman equation that allows the agents to
flexibly terminate their options. We evaluate our model empirically on a set of
multi-agent pursuit and taxi tasks, and show that our agents learn to adapt
flexibly across scenarios that require different termination behaviours.Comment: PRICAI 201
Hi-Val: Iterative Learning of Hierarchical Value Functions for Policy Generation
Task decomposition is effective in manifold applications where the global complexity of a problem makes planning and decision-making too demanding. This is true, for example, in high-dimensional robotics domains, where (1) unpredictabilities and modeling limitations typically prevent the manual specification of robust behaviors, and (2) learning an action policy is challenging due to the curse of dimensionality. In this work, we borrow the concept of Hierarchical Task Networks (HTNs) to decompose the learning procedure, and we exploit Upper Confidence Tree (UCT) search to introduce HOP, a novel iterative algorithm for hierarchical optimistic planning with learned value functions. To obtain better generalization and generate policies, HOP simultaneously learns and uses action values. These are used to formalize constraints within the search space and to reduce the dimensionality of the problem. We evaluate our algorithm both on a fetching task using a simulated 7-DOF KUKA light weight arm and, on a pick and delivery task with a Pioneer robot
ContextVP: Fully Context-Aware Video Prediction
Video prediction models based on convolutional networks, recurrent networks,
and their combinations often result in blurry predictions. We identify an
important contributing factor for imprecise predictions that has not been
studied adequately in the literature: blind spots, i.e., lack of access to all
relevant past information for accurately predicting the future. To address this
issue, we introduce a fully context-aware architecture that captures the entire
available past context for each pixel using Parallel Multi-Dimensional LSTM
units and aggregates it using blending units. Our model outperforms a strong
baseline network of 20 recurrent convolutional layers and yields
state-of-the-art performance for next step prediction on three challenging
real-world video datasets: Human 3.6M, Caltech Pedestrian, and UCF-101.
Moreover, it does so with fewer parameters than several recently proposed
models, and does not rely on deep convolutional networks, multi-scale
architectures, separation of background and foreground modeling, motion flow
learning, or adversarial training. These results highlight that full awareness
of past context is of crucial importance for video prediction.Comment: 19 pages. ECCV 2018 oral presentation. Project webpage is at
https://wonmin-byeon.github.io/publication/2018-ecc
- …